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Sparse Multinomial Logistic Regression via 
Approximate Message Passing 

Evan Byrne and Philip Schniter* 


Abstract —For the problem of multi-class linear classification 
and feature selection, we propose approximate message passing 
approaches to sparse mnltinomial logistic regression (MLR). 
First, we propose two algorithms based on the Flybrid Gen¬ 
eralized Approximate Message Passing (HyGAMP) framework: 
one finds the maximum a posteriori (MAP) linear classifier and 
the other finds an approximation of the test-error-rate minimiz¬ 
ing linear classifier. Then we design computationally simplified 
variants of these two algorithms. Next, we detail methods to 
tune the hyperparameters of their assumed statistical models 
using Stein’s unbiased risk estimate (SURE) and expectation- 
maximization (EM), respectively. Einally, using both synthetic 
and real-world datasets, we demonstrate improved error-rate 
and rnntime performance relative to existing state-of-the-art 
approaches to sparse MLR. 

Index Terms —Classification, feature selection, belief propaga¬ 
tion, message passing. 


I. Introduction 

A. Objective 

We consider the problems of multiclass (or polytomous) 
linear classification and feature selection. In both problems, 
one is given training data of the form {{i/m, o,m)}m=i’ where 
G is a vector of features and G D} is the 

corresponding D-ary class label. In multiclass classification, 
the goal is to infer the unknown label t/o associated with a 
newly observed feature vector Oq. In the linear approach to 
this problem, the training data are used to design a weight 
matrix X G R^^^ that generates a vector of “scores” zq = 
G R^, the largest of which can be used to predict the 
unknown label, i.e., 

yo = argmax[zo]d- (1) 


In the N ^ M case, accurate linear classification and 
feature selection may be possible if the labels are influenced 
by a sufficiently small number, K, of the total N features. 
For example, in binary linear classification, performance guar¬ 
antees are possible with only M = 0{K log N/K) training 
examples when is i.i.d. Gaussian ||8l. Note that, when 
K N, accurate linear classification can be accomplished 
using a sparse weight matrix X, i.e., a matrix where all but 
a few rows are zero-valued. 


B. Multinomial logistic regression 

For multiclass linear classification and feature selection, 
we focus on the approach known as multinomial logis¬ 
tic regression (MLR) a, which can be described using a 
generative probabilistic model. Here, the label vector y — 
[yo,..., j/m]^ is modeled as a realization of a randoirQ vector 
y — [Voi ■ • ■! Vm]^’ “true” weight matrix X is modeled 
as a realization of a random matrix X, and the features 
A = [ao,...,aM]^ are treated as deterministic. Moreover, 
the labels y^ are modeled as conditionally independent given 
the scores am, i-e., 

M 

Pr{y = y|X = X;A}= py|,(y,„|xVn), (2) 

m—1 


and distributed according to the multinomial logistic (or soft- 
max) pmf; 


/ I _ exp([z„,]y^) 

P\/\z\ym\Zm) — jj ■ 

^d=l 6Xp([ZmJd) 


yme {1,...,D}. (3) 


The rows of the weight matrix X are then modeled as i.i.d.. 


In feature selection, the goal is to determine which subset of 
the N features oq is needed to accurately predict the label yo- 
We are particularly interested in the setting where the 
number of features, N, is large and greatly exceeds the number 
of training examples, M. Such problems arise in a number of 
important applications, such as micro-array gene expression 
HD, multi-voxel pattern analysis (MVPA) (ED, text mining 
I5l6l , and analysis of marketing data Q. 
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N 

Fx(X) = ]^Px(a:n), (4) 

n—1 

where px may be chosen to promote sparsity. 

C. Existing methods 

Several sparsity-promoting MLR algorithms have been pro¬ 
posed (e.g., OlOll 111211311411511 ). differing in their choice of 
Px and methodology of estimating X. For example, II11I12I13I 
use the i.i.d. Laplacian prior 

Px{Xn] n 2 (5) 

d^l 

^For clarity, we typeset random quantities in sans-serif font and detemiin- 
istic quantities in serif font. 
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with A tuned via cross-validation. To circumvent this tuning 
problem, HI employs the Laplacian scale mixture 


D 


PxiXn) = P[ 


- exp(-A|a;„c;|) 


p(A) dA, 


( 6 ) 


with Jeffrey’s non-informative hyperprior p{X) oc j1a>o. 
The relevance vector machine (RVM) approach Ho) uses the 
Gaussian scale mixture 


D 

Px{Xn) = 11 / Xf{Xnd;0,t^)p{t^)dl', 
d=l 


(7) 


with inverse-gamma p{i') (i.e., the conjugate hyperprior), 
resulting in an i.i.d. student’s t distribution for px- However, 
other choices are possible. For example, the exponential hy¬ 
perprior p(i/; A) = ^ exp(—would lead back to the 
i.i.d. Laplacian distribution Q for px Hd)- Finally, IfTSl uses 


Px{xn\ A) (X exp(-A||£c„|| 2 ), (8) 


which encourages row-sparsity in X. 

Once the probabilistic model (HI)-® has been specified, 
a procedure is needed to infer the weights X from the 
training data {(t/m, The Laplacian-prior methods 

011I12I13I15II use the maximum a posteriori (MAP) estimation 
framework: 

X = argimxlogp(X|y; A) (9) 

M N 

= argmax E 10gPy|z(y77i |dm ) + ^\ogpx{x„), (10) 

m—1 n=l 

where Bayes’ rule was used for (fTOl) . Under px from (|5]l or 
the second term in (fTOl) reduces to —ll^Jnlli or 
~^J2n=i respectively. In this case, (fTO is concave 

and can be maximized in polynomial time; 0111121131151 
employ (block) coordinate ascent for this purpose. The papers 
cni and IH handle the scale-mixture priors (|6]) and Q, 
respectively, using the evidence maximization framework iflTl . 
This approach yields a double-loop procedure: the hyperpa¬ 
rameter A or is estimated in the outer loop, and—for fixed 
A or Id —the resulting concave (i.e., £2 or £i regularized) MAP 
optimization is solved in the inner loop. 

The methods 01011 111^131141151 described above all yield 
a sparse point estimate X. Thus, feature selection is accom¬ 
plished by examining the row-support of X and classification 
is accomplished through O- 


D. Contributions 

In Section HI] we propose new approaches to sparse-weight 
MLR based on the hybrid generalized approximate message 
passing (HyGAMP) framework from ([TQ. HyGAMP offers 
tractable approximations of the sum-product and min-sum 
message passing algorithms Km by leveraging results of the 
central limit theorem that hold in the large-system limit: 
limTv M^oo with fixed N/M. Without approximation, both the 
sum-product algorithm (SPA) and min-sum algorithm (MSA) 
are intractable due to the forms of pyj^ and px in our problem. 


For context, we note that HyGAMP is a generalization of the 
original GAMP approach from ll20l . which cannot be directly 
applied to the MLR problem because the likelihood function 
© is not separable, i.e., Py\z{ym\zm) ^ ]\dPiym\Zmd)- 
GAMP can, however, be applied to binary classification and 
feature selection, as in ET\ . Meanwhile, GAMP is itself a 
generalization of the original AMP approach from 1221231 . 
which requires py|z to be both separable and Gaussian. 

With the HyGAMP algorithm from ifTSl . message passing 
for sparse-weight MLR reduces to an iterative update of 
0{M + N) multivariate Gaussian pdfs, each of dimension 
D. Although HyGAMP makes MLR tractable, it is still not 
computationally practical for the large values of M and N in 
contemporary applications (e.g., N ~ 10^ to 10® in genomics 
and MVPA). Similarly, the non-conjugate variational message 
passing technique from ll24l requires the update of 0{MN) 
multivariate Gaussian pdfs of dimension D, which is even less 
practical for large M and N. 

Thus, in Section |ni| we propose a simplified HyGAMP 
(SHyGAMP) algorithm for MLR that approximates 
HyGAMP’s mean and variance computations in an efficient 
manner. In particular, we investigate approaches based on 
numerical integration, importance sampling, Taylor-series 
approximation, and a novel Gaussian-mixture approximation, 
and we conduct numerical experiments that suggest the 
superiority of the latter. 

In Section |IV| we detail two approaches to tune the hy¬ 
perparameters that control the statistical models assumed by 
SHyGAMP, one based on the expectation-maximization (EM) 
methodology from ll25l and the other based on a variation of 
the Stein’s unbiased risk estimate (SURE) methodology from 
ll26l . We also give numerical evidence that these methods yield 
near-optimal hyperparameter estimates. 

Einally, in Section |V| we compare our proposed SHyGAMP 
methods to the state-of-the-art MLR approaches mu on 
both synthetic and practical real-world problems. Our experi¬ 
ments suggest that our proposed methods offer simultaneous 
improvements in classification error rate and runtime. 

Notation: Random quantities are typeset in sans-serif (e.g., 
x) while deterministic quantities are typeset in serif (e.g., x). 
The pdf of random variable X under deterministic parameters 6 
is written as Px{x\ 6), where the subscript and parameterization 
are sometimes omitted for brevity. Column vectors are typeset 
in boldface lower-case (e.g., y or y), matrices in boldface 
upper-case (e.g., X or X), and their transpose is denoted by 
(•)^. E{-} denotes expectation and Cov{-} autocovariance. Ik 
denotes the K x K identity matrix, the fcth column of Ik, 
Ik the length-AT vector of ones, and Diag(6) the diagonal 
matrix created from the vector b. [B]m,n denotes the element 
in the row and column of B, and || • ||f the Erobenius 
norm. Einally, denotes the Kronecker delta sequence, 5{x) 
the Dirac delta distribution, and 1^ the indicator function of 
the event A. 

IT HyGAMP for Multiclass Classification 

In this section, we detail the application of HyGAMP ifTSI to 
multiclass linear classification. In particular, we show that the 
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Fig. 1: Factor graph representations of tn, with white/gray circles 
denoting unobserved/observed random variables, and gray rectangles 
denoting pdf “factors”. 


sum-product algorithm (SPA) variant of HyGAMP is a loopy 
belief propagation (LBP) approximation of the classification- 
error-rate minimizing linear classifier and that the min-sum 
algorithm (MSA) variant is an LBP approach to solving the 
MAP problem ( fTOl i. 


A. Classification via sum-product HyGAMP 


Suppose that we are given M labeled training pairs 
{(2/m,am)}m=i and T test feature vectors as¬ 
sociated with unknown test labels obeying 

the MLR statistical model (Ell-®. Consider the problem of 


computing the classification-error-rate minimizing hypotheses 




giver0 in Algorithm[T] Although in practical MLR applications 
A is not i.i.d. Gaussianf] the numerical results in Section |V] 
suggest that treating it as such works sufficiently well. 

We note from Fig. [T^ that the HyGAMP algorithm is 
applicable to a factor graph with vector-valued variable nodes. 
As such, it generalizes the GAMP algorithm from ll20l . which 
applies only to a factor graph with scalar-variable nodes. Be¬ 
low, we give a brief explanation for the steps in Algorithm [T] 
For those interested in more details, we suggest m for an 
overview and derivation of HyGAMP, ESI for an overview 
and derivation of GAMP, 1^ for rigorous analysis of GAMP 
under large i.i.d. sub-Gaussian A, and 1291301 for fixed-point 
and local-convergence analysis of GAMP under arbitrary A. 

Lines |6]l2] of Algorithm [T] produce an approximation of the 
posterior mean and covariance of X„ at each iteration t. Sim¬ 
ilarly, lines [TSlfTb] produce an approximation of the posterior 
mean and covariance of Z™ = X^a^. The posterior mean 
and covariance of X„ are computed from the intermediate 
quantity r„(f), which behaves like a noisy measurement of 
the true Xn- In particular, for i.i.d. Gaussian A in the large- 
system limit, r„(f) is a typical realization of the random vector 
fn = tCn+V„ with V„ ~ Qh(t))- Thus, the approximate 

posterior pdf used in lines |6]l7] is 


Px\ri^n\fn] Qn) 


f py(x'JAf(x'„;rn, Q'J d< ' 


(15) 


A similar interpretation holds for HyGAMP’s approximation 
of the posterior mean and covariance of Z™ in lines [TSlfTbl 
which uses the intermediate vector (f) and the approximate 
posterior pdf 


under known pyj^ and px, where = [t/i,. •. ,2 /m]^ and 

A = [qi, ..., The probabilities in (fTTl i can be 

computed via the marginalization 

= E /py,x(y,X;A)dX, (13) 

yeytivt) 

with scaling constant label vector y = [i/i,..., 
and constraint set 3^t(y) = {y G s.t. [y]i = 

y and [y]m = ym Vm = 1,...,M}, which fixes the tth 
element of y at the value y and the first M elements of y 
at the values of the corresponding training labels. Due to ® 
and ®, the joint pdf in (fTST l factors as 

M+T N 

Py.x{y,X-,A) = Py\z{ym\X~^ajn) F[px(a:n)- (14) 

m—1 n—1 

The factorization in (fT4l i is depicted by the factor graph in 
Fig- [13 where the random variables {y^} and random vectors 
{x„} are connected to the pdf factors in which they appear. 

Since exact computation of the marginal posterior test-label 
probabilities is an NP-hard problem ll27l . we are interested 
in alternative strategies, such as those based on loopy belief 
propagation by the SPA ED. Although a direct application of 
the SPA is itself intractable when pyj^ takes the MLR form 
®, the SPA simplifies in the large-system limit under i.i.d. 
sub-Gaussian A, leading to the HyGAMP approximation ifTSl 


Pz|y.p(-2m|ymjPmj Qm) 

_ Py|z(2/m|-2m)A/'(-Zm; Pmi Qm) 

JPy\z{ym\z'JAf{z'^;p^,Ql,) dz'^' 

B. Classification via min-sum HyGAMP 

As discussed in Section II-CI an alternative approach to 
linear classification and feature selection is through MAP 
estimation of the true weight matrix X. Given a likelihood 
of the form ® and a prior of the form ®, the MAP estimate 
is the solution to the optimization problem (ITOl i. 

Similar to how the SPA can be used to compute approximate 
marginal posteriors in loopy graphs, the min-sum algorithm 
(MSA) im can be used to compute the MAP estimate. 
Although a direct application of the MSA is intractable when 
Py|z takes the MLR form ®, the MSA simplifies in the large- 
system limit under i.i.d. sub-Gaussian A, leading to the MSA 
form of HyGAMP specified in Algorithm [T] 

As described in Section III-AI when A is large and i.i.d. 
sub-Gaussian, the vector (f) in Algorithm [T] behaves like 
a Gaussian-noise-corrupted observation of the true Xn with 
noise covariance Qn{t). Thus, line [3] can be interpreted as 

^The HyGAMP algorithm in flsl is actually more general than what is 
specified in Algorithm^ but the version in Algorithm^is sufficient to handle 
the factor graph in Fig. \M 

^We note that many of the standard data pre-processing techniques, such as 
z-scoring, tend to make the feature distributions closer to zero-mean Gaussian. 
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Algorithm 1 HyGAMP 


Require: Mode G {SPA, MSA}, matrix A, vector y, pdfs px|r Pz\y,p 
from 1151 - 1161 . initializations r„(0), Q5^(0). 

Ensure: 0; Sm(0)-<—0. 

1: repeat 

if MSA then {for n = 1... N} 

Xn(t) ■«- argmax^ logpx|r(a;„|f„(f-l); Q’r^{t-1)) 

Q* (i) ^[ - -§^ logPx|r(®n(t)|r„(t-l); Q!;(t-1))]"^ 

else if SPA then {for n = 1... N} 

Xn{t') i — E {Xn I Tn = Vn{t — 1); —1)} 

Q* (i) ^ Cov {x„ I r„ = Q’r^{t-1)} 


8 

end if 




9 

Vm : Q^(t) <- AfnQnit) 




10 

Vm : p„,it) •«- AmnXnf) 

- o’* 

(t)s 

m(t-l) 

11 

if MSA then (for m = 1... M} 




12 

Zm{t) -f— argmaxj. logpz|y_p(zm|2/m 

! Pm 

ity,QUt)) 

13 

Qm(i) ^ logPz|y,p(2m 

{t)\ym,P 

rnityQUt))X 

14 

else if SPA then {for m = 1... M] 




15 

^m(i) E {Ztti I Urri', Prn ~ Pm 


16 

Qmf) ^ Cov{z„ ym,Prn = 

Pmf)-, Q 

Ut)} 

17 

end if 




18 

Vm: Q^(t)^[Q^(i)]-l-[Q^ 

it)]- 

^Qlnit)lQUt)]-^ 

19 

Vm : Sm{t) <- [Qm(t)]“^(2m(i) 

~ Pm 

it)) 


20 

Vn: Q’n{t)^[j:XiAlnQUt)X 



21 

Vn : r„(t) <- $„(t) -|- Q!:,(t) 


22 

t i + 1 




23 

until Terminated 





MAP estimation of £c„ and line |4] as measuring the local cur¬ 
vature of the corresponding MAP cost. Similar interpretations 
hold for MAP estimation of Zm via lines fTSlfOl 


C. Implementation of sum-product HyGAMP 

From Algorithm [T] we see that HyGAMP requires inverting 
M-\-N matrices of size DxD (for lines fr8]andl20li in addition 
to solving M-\-N joint inference problems of dimension D in 
lines[3]l7]and fT2lfT6l We now briefly discuss the latter problems 
for the sum-product version of HyGAMP. 

1) Inference of Xn-' One choice of weight-coefficient prior 
Px„ that facilitates row-sparse X and tractable SPA inference 
is Bernoulli-multivariate-Gaussian, i.e., 


Px(Xn) = (1 - P)SiXn) -f PffiXn]0, vl), (17) 


where S{-) denotes the Dirac delta and /3 € (0,1]. In this case, 
it can be shown ED that the mean and variance computations 
in lines |6]l2] of Algorithm [D reduce to 


^ 1 -/? 

j3 M{Q]fn,vI + Q'^„) 

Xu = cf^{i + 

= Cf\l + + (G„ - l)xr,xl 


(18) 

(19) 

( 20 ) 


which requires a D x D matrix inversion at each n. 

2) Inference of Zm-' When py|z takes the MLR form in El- 
closed-form expressions for Zm{t) and Q^{t) from lines 1151 
IT6l of Algorithm [D do not exist. While these computations 
could be approximated using, e.g., numerical integration or im¬ 
portance sampling, this is expensive because Zm{t) and Q^{t) 
must be computed for every index m at every HyGAMP 


iteration t. More details on these approaches will be presented 
in Section IIII-Cl in the context of SHyGAMP. 


D. Implementation of min-sum HyGAMP 

1) Inference of Xn-' To ease the computation of line [3 in 
Algorithm [D it is typical to choose a log-concave prior px 
so that the optimization problem ( fTOl i is concave (since Py|z 
in (ID is also log-concave). As discussed in Section II-CI a 
common example of a log-concave sparsity-promoting prior 
is the Laplace prior Q- In this case, line El becomes 

Xn = argmax-i(a: - rn)^[Qn]~^{x - f„) - A||a;||i, 

X Z 

( 21 ) 

which is essentially the LASSO ED problem. Although (ED 
has no closed-form solution, it can be solved iteratively using, 
e.g., minorization-maximization (MM) ED . 

To maximize a function J{x), MM iterates the recursion 

= argmax J(a:; (22) 


where J{x;x) is a surrogate function that minorizes J{x) 
at X. In other words, J{x;x) < J{x) Mx for any fixed 
X, with equality when x = x. To apply MM to (ED- 
we identify the utility function as Jn{x) = —^{x — 

_ fn) — A||a:||i. Next we apply a result from 
E4l that established that Jn{x) is minorized by x^'^) = 
-^{x-rny[Q'n]-^{x-rn) - ^ {x'^ Aixl^^)x -b Wx^n'^Wl) 
with A(iE) = Diag {|xi|“^,..., Thus (ED implies 

= nrgmax J„(a;; (23) 

X 

= argmaxa;'^[Q;;,]”^f„ - ^x'^{[Qlr^ + XA{x''n^))x 

(24) 

= {[QX^ + XA{xX)-\QlXrn (25) 

where (ED dropped the tc-invariant terms from Jn{x]xX)- 
Note that each iteration k of (ED requires a D x D matrix 
inverse for each n. 

Line |4] of Algorithm [D then says to set equal to the 
Hessian of the objective function in (ED at Xn- Recalling that 
the second derivative of \xnd\ is undefined when Xnd = 0 but 
otherwise equals zero, we set — Qh but then zero the dth 
row and column of for all d such that Xnd = 0. 

2) Inference of z^: Min-sum HyGAMP also requires the 
computation of lines fTDfTD in Algorithm [D In our MLR ap¬ 
plication, line[T2]reduces to the concave optimization problem 

Zn, = avgT!iax-]-{z-X?[QVi~^{z-Pm) 

Z z 

+ l0gPy|z(2/mN)- (26) 


Although (ED can be solved in a variety of ways (see ED for 
MM-based methods), we now describe one based on Newton’s 

method ED- i-e- 


^(fc+i) ^ ^(fc) 




(27) 


where g^m and are the gradient and Hessian of the 

objective function in (ED at and G (0,1] is a 
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stepsize. From Q, it can be seen that ^ logpy|z(t/| 2 ) = 
5y^i -py|z(i| 2 ;), and so 

gW = u{zl^'>) - By^ + - p^), (28) 

where By denotes the yth column of Id and 11 ( 2 :) G is 

defined elementwise as 


[M(2)]i =Py|z(i|2). (29) 

Similarly, it is known ll^ that the Hessian takes the form 

= u(z„^)u(Zm)'^ - Diag{M(2m)} - (30) 


which also provides the answer to line [13] of Algorithm [T| 
Note that each iteration k of (l27l i requires a D x D matrix 
inverse for each m. 

It is possible to circumvent the matrix inversion in (l27l i via 
componentwise update, i.e.. 


-(fc+i) ^ -(fc) 

^md 




md ’ 


(31) 


(k) (k) 

where and are the first and second derivatives of 

the objective function in (l26l l with respect to Zd at z = 

From (l28T l- (l30l l. it follows that 


ffmd =Fy|z(^|2m^) - ^ym-d+ [[Q™] - Pm) 

(32) 

Hml =Fy|z(d|2ijy -Py|z(d|2iJ^) - [[Qm\~^]dd- 


E. HyGAMP summary 

In summary, the SPA and MSA variants of the 
HyGAMP algorithm provide tractable methods of 
approximating the posterior test-label probabilities 
Pytlyi:M(y‘I and computing the MAP weight 

matrix X = argmaxxPy^^j^,x(yi:Mi "4), respectively, 
under a separable likelihood (|2]i and a separable prior 
(|4]i. In particular, HyGAMP attacks the high-dimensional 
inference problems of interest using a sequence of M + N 
low-dimensional (in particular, D-dimensional) inference 
problems and D x D matrix inversions, as detailed in 
Algorithm [T] 

As detailed in the previous subsections, however, these D- 
dimensional inference problems are non-trivial in the sparse 
MLR case, making HyGAMP computationally costly. We refer 
the reader to Table U for a summary of the D-dimensional 
inference problems encountered in running SPA-HyGAMP 
or MSA-HyGAMP, as well as their associated computational 
costs. Thus, in the sequel, we propose a computationally 
efficient simplification of HyGAMP that, as we will see in 
Section lYl compares favorably with existing state-of-the-art 
methods. 


HI. SHyGAMP for Multiclass Classification 

As described in Section |II| a direct application of HyGAMP 
to sparse MLR is computationally costly. Thus, in this section, 
we propose a simplified HyGAMP (SHyGAMP) algorithm 
for sparse MLR, whose complexity is greatly reduced. The 
simplification itself is rather straightforward: we constrain the 


Algorithm 

Quantity 

Method 

Complexity 


X 

CF 

0(jW) 

SPA- 

Q* 

CF 

O(D^) 

HyGAMP 

z 

NI 

O(D^) 



NI 

O(D^) 


X 

MM 

TXKW) 

MSA- 

Q* 

CF 

0{D^) 

HyGAMP 


CWN 

0(KD'^ + D^) 



CF 

0{D^) 


TABLE I: A summary of the D-dimensional inference sub-problems 
encountered when running SPA-HyGAMP or MSA-HyGAMP, as 
well as their associated computational costs. ‘CF’ = ‘closed form’, 
‘NT = ‘numerical integration’, ‘MM’ = ‘minorization-maximization’, 
and ‘CWN’ = ‘component-wise Newton’s method’. For the NI 
method, K denotes the number of samples per dimension, and for 
the MM and CWN methods K denotes the number of iterations. 


covariance matrices Qn^ Ql, Qm’ and to be diagonal. In 
other words, 

Qn = Diag{g;;i,...,g;;£,}, (34) 

and similar for Q^, and Q^. As a consequence, the 

D X D matrix inversions in lines [18] and [20] of Algorithm [T] 
each reduce to D scalar inversions. More importantly, the D- 
dimensional inference problems in lines l3ll7] and [T2][T6l can be 
tackled using much simpler methods than those described in 
Section m as we detail below. 


A. Scalar Variance Approximation 


We further approximate the SHyGAMP algorithm using 
the scalar variance GAMP approximation from na, which 
reduces the memory and complexity of the algorithm. The 
scalar variance approximation first approximates the variances 
{and} ^ value invariant to both n and d, i.e., 

05) 

n—1d—1 

Then, in line [9] in Algorithm [H we use the approximation 


N 


dmd 


n—1 


2 

mn^ 


M 


X A D 

a = r- 


(36) 


The approximation (a), after precomputing || A.|||,, reduces the 
complexity of line [9] from 0{ND) to 0(1). We next define 


S A 

a = 


1 


MD 


M D 

EE 




m—1 d—1 

and in line |2^ we use the approximation 

-1 


M 


^nd 


E ^-9' 


N 


g®||A| 


A r 

T = a- 


(37) 


(38) 


The complexity of line [20] then simplifies from 0{MD) 
to 0(1). For clarity, we note that after applying the scalar 
variance approximation, we have — q^In'^n, and similar 
for Ql^, and 
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B. Sum-product SHyGAMP: Inference of Xn 


With diagonal and Q^, the implementation of lines |6]17] 
is greatly simplified by choosing a sparsifying prior px with the 
separable form px{xn) = Y\’d=iP'f-{^nd)- A common example 
is the Bernoulli-Gaussian (BG) prior 


Px{Xnd) = (1 - Pd)S{Xnd) + Pd^f{x„d; rUd, Vdl). (39) 


For any separable px, lines |6]l7] reduce to computing the mean 
and variance of the distribution 


PY.\r{Xnd\Xnd\ Qnd) 


PxGnd)MGnd;r„d,q'nd) 

f PAXnd)-'^(KdXnd,g'„J dx'^^ ■ 


(40) 


for all n = 1... and d = 1... as in the simpler GAMP 
algorithm Eol. With the BG prior (IWt . these quantities can 
be computed in closed form (see, e.g., ED). 


C. Sum-product SHyGAMP: Inference of Zm 


With diagonal and Q^, the implementation of lines fTSl- 
fThl can also be greatly simplified. Essentially, the problem 
becomes that of computing the scalar means and variances 


Zmd = Cj^ ZdP^/\z{y7n\z)W^^{zk]Pmk,qt,k)'^^ (41) 

k=l 

_ r D ^ 

^md~^m / Z:dP\j\z{ym\z)\\M{z].\Pjnk^(^f)Az — Z^d 

fe=i 


(42) 


for m = 1.. .M and d = 1... D. Here, py|z has the MLR 
form in Q and Cm is a normalizing constant defined as 

A f ^ 

Cm= Py\z{yfn\z) T\Af{zk-,Pmk,qmk)<^^- (43) 

fc=l 

Note that the likelihood py|z is not separable and so inference 
does not decouple across d, as it did in (|40] |. We now describe 
several approaches to computing (l4ni - (l42li . 


1) Numerical integration: A straightforward approach to 
(approximately) computing (EB-dH is through numerical 
integration (Nl). For this, we propose to use a hyper- 
rectangular grid of 2 : values where, for Zd, the interval 


Pmd dmd^ Pmd 4“ 



is sampled at K equi- 


spaced points. Because a D-dimensional numerical integral 
must be computed for each index m and d, the complexity 
of this approach grows as 0{MDK^), making it impractical 
unless D, the number of classes, is very small. 


2) Importance sampling: An alternative approximation of 
(I4n i-(|4^ can be obtained through importance sampling (IS) ||9] 
§11.1.4]. Here, we draw K independent samples {zm[k]}^^i 


from M{Pm, Qm) ^nd compute 

K 

Cm^^Py\z{ym\Zm[k]) (44) 

k^l 

K 

^md ^md[^]Py\z{ym \^m [^]) (^ 5 ) 

k^l 
K 

dmd^^m Z:md[k]Py\z{ym\Zm[k]) — Zmd (46) 

k=l 

for all m and d. The complexity of this approach grows as 
0{MDK). 


3) Taylor-series approximation: Another approach is to 
approximate the likelihood py|z using a second-order Taylor 
series (TS) about i.e., Py\z{ym\z) « fm{z;Pm) with 

!m{z\Pm) = Py\z{ym\Pm) + 9m^m^ -p^) 

+ \iz-Pm)^Hm{Pm)iz:-Pm) ( 47 ) 
for gradient g^(p) = ^Py|z(2/mN)| and Hessian 

2 Z—p 

Hm{p) — '§^Py\z{.ym\z)\ In this case, it can be shown 
ll3T]l that 


I D 

Cm ~ fmiPm) + 2 ^ ^ ^ImkiPm)^, 


(48) 




Zmd ~ Cm ^fmiPm) Pmd 9md{Pm)Qmd 

+ n 'y ^ Pmkdmk^mkjPm) 1 


(49) 


dmd ~ Cm [ fm{Pm) (.Pmd 4~ dmd) 4” ‘^9md(Pm)Pmddmd 


+ 


2^md(Pmd 4 ” ^9md} ^^d,(Pm) 


4" r) (Pmd 4" 9md) C^mdiPm) II I - ^md, (50) 

k^d / 

fere Hmd(p) = [Hm(p)]dd- The complexity of this ap- 


4} Gaussian mixture approximation: It is known that the 
logistic cdf 1/(1 -b exp(—cc)) is well approximated by a 
mixture of a few Gaussian cdfs, which leads to an efficient 
method of approximating dHli-dlill in the case of binary 
logistic regression (i.e., D = 2) 1^ . We now develop an 
extension of this method for the MLR case (i.e., D > 2). 

To facilitate the Gaussian mixture (GM) approximation, we 
work with the difference variables 

(y) ^{zy-Zd dfy 
\zy d = y 


( 51 ) 
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Their utility can be seen from the fact that (recalling Q) 

1 


fy|z(yN) = Y 


+ Ed/y exp(zd - Zy) 

1 


(52) 


l + Ed/yexp(-7(*'^) 


A /(y)(^(y))^ (53) 


which is smooth, positive, and bounded by 1, and 
strictly increasing in ThusQ for appropriately chosen 

{Od, HkhCTkl}, 


1=1 k^y 


Ik — ^-kl 
Ckkl 




(54) 


where $(a;) is the standard normal cdf, aki > 0, a; > 0, 
and = 1- In practice, the GM parameters {«;, /tfez, crfej} 

could be designed off-line to minimize, e.g., the total variation 
distance sup^^ggc 

Recall from (ITTIi - d?^ that our objective is to compute 
quantities of the form 

/ {e\zyPy\z{y\z)M{z\p,Q'^)dz = S^y^, (55) 

dRr> 

where i £ {0,1, 2}, is diagonal, and is the dth column 
of Id- To exploit (l54l i. we change the integration variable to 

= TyZ (56) 


with 


Ty = 


(57) 


l(y-l)xl 0(y-l)x(n-y) 

Olx(i/—1) 1 Olx(-D—y) 

_0(_D-y)x(y-l) l(D-y)xl —lo-y 

to get (since det(Tj^) = 1) 

= / {elT-^-fyi(y\'j)M{r,TyP,TyQ^T'l)dj. 

dRD 

(58) 

Then, applying the approximation (l54l i and 


Afir^Typ.TyQ^T]) =Af{jy-py,qP) 


X J\f{jk ; ly - Pk, ql) (59) 

k^y 


to (158b . we find that 

L 

^iy) ^ 

1=1 






Y[-^{lk;7y-Pk,ql)^ 


k^y 


Noting that T ^ = Ty, we have 


Ik — Pkl 
<^kl 


d7fc 


d7y. 

(60) 




ly-Id d^y 

ly d=y 


(61) 


^Note that, since the role of y in is merely to ignore the yth 

component of the input -y, we could have instead written 
for ^/-invariant /(•) and Jy constructed by removing the yth row from the 
identity matrix. 


Thus, for a fixed value of 7 y = c, the inner integral in (l60b 
can be expressed as a product of linear combinations of terms 



7-p 


d7 = T, 


(62) 


with i £ {0,1, 2}, which can be computed in closed form. In 

particular, defining x = , we have 

V‘^^+9 


To = $(1) 


Ti 


(c - p)<i)(x) -I- 


q^x) 

V'^'^ + q 


(Ti)^ 

$(x) 


q^{x) 


q^(t>{x) ( ^{x) \ 

0-2 + g V <I>(x) / ’ 


(63) 

(64) 

(65) 


which can be obtained using the results in ll^ §3.9]. The 
outer integral in (l60b can then be approximated via numerical 
integration. 

If a grid of K values is used for numerical integration over 
7 j, in ( l60b . then the overall complexity of the method grows 
as 0{MDLK). Our experiments indicate that relatively small 
values (e.g., L = 2 and K = 7) suffice. 

5) Performance comparison: Above we described four 
methods of approximating lines [TSlfTbl in Algorithm [T] under 
diagonal and Q^. We now compare the accuracy and 
complexity of these methods. In particular, we measured the 
accuracy of the conditional mean (i.e., linefTSb approximation 
as follows (for a given p and Q^): 

1 ) generate i.i.d. samples 2 :true[f] ^ ■J^{z',p,Q^) and 
2 /true [f] ~ Py\z{y I 2 :true[f]) for f = 1 . . . T, 

2) compute the approximation 'z\f\ ~ E{z | y = 

2 /true [i]jP = using each method described in 

Sections IIII-ClIInT^ 

3) compute average MSE = ^ EtLi ||^true[f] — ^W ||2 I®*" 
each method, 

and we measured the combined runtime of lines [TSlfTbl for each 
method. Unless otherwise noted, we used 79 = 4 classes, p = 
ei, = q^Io, and < 7 ^ = 1 in our experiments. For numerical 
integration (NI), we used a grid of size K = 7 and radius of 
a = A standard deviations; for importance sampling (IS), we 
used K = 1500 samples; and for the Gaussian-mixture (GM) 
method, we used L = 2 mixture components and a grid size 
of AT = 7. Empirically, we found that smaller grids or fewer 
samples compromised accuracy, whereas larger grids or more 
samples compromised runtime. 

Figure |2] plots the normalized MSE versus variance 
for the four methods under test, in addition to the trivial 
method ^\t\ = p. The figure shows that the NI, IS, and 
GM methods performed similarly across the full range of 
gP and always outperform the trivial method. The Taylor- 
series method, however, breaks down when qP > 1. A close 
examination of the figure reveals that GM gave the best 
accuracy, IS the second best accuracy, and NI the third best 
accuracy. 

Figure [3] shows the cumulative runtime (over M = 500 
training samples) of the methods from Sections IIII-ClIiriI-C4l 
versus the number of classes, 79. Although the Taylor-series 
method was the fastest, we saw in Fig.|2]that it is accurate only 
at small variances gP. Figure [3 then shows GM was about an 



























Fig. 2: MSE/g*’ versus variance for various methods to compute 
line [B] in Algorithm [T] Each point represents the average of 5 x 10® 
independent trials. 


Algorithm 

Quantity 

Method 

Complexity 


x 

CF 


SPA- 

Q* 

CF 

0(D) 

SHyGAMP 

z 

GM 

O(LKD) 



GM 

O(LKD) 


X 

ST 


MSA- 

Q* 

CF 

0(D) 

SHyGAMP 

z 

CWN 

0(KD) 



CF 

0(D3) 


TABLE II: A summary of the D-dimensional inference sub-problems 
encountered when running SPA-SHyGAMP or MSA-SHyGAMP, as 
well as their associated computational costs. ‘CP’ = ‘closed form’, 
‘GM’ = ‘Gaussian mixture’, ‘ST’ = ‘Soft-thresholding’, and ‘CWN’ 
= ‘component-wise Newton’s method’. For the GM, L denotes the 
number of mixture components and K the number of samples in 
the ID numerical integral, and for CWN K denotes the number of 
iterations. 


which is known to have the closed-form “soft thresholding” 
solution 

Xnd = sgn(f„d) max{0, |f„d| - Mnd)- (67) 



Fig. 3; Cumulative runtime (over M = 500 samples) versus number- 
of-classes D for various methods to compute lines rBTT^ in Algo- 
rithm[T] Each point represents the average of 2000 independent trials. 


order-of-magnitude faster than IS, which was several orders- 
of-magnitude faster than NI. 

Together, Figures |2K3 show that our proposed GM method 
dominated the IS and NI methods in both accuracy and 
runtime. Thus, for the remainder of the paper, we imple¬ 
ment sum-product SHyGAMP using the GM method from 
Section IIII-C4I 


Above, sgn(r) = 1 when r > 0 and sgn('r) = — 1 when r < 0. 
Meanwhile, line |4] reduces to 


X 

^nd 


dx'^ 


f 1 

V2 C 




-1 


( 68 ) 


which equals when Xnd ^ 0 and is otherwise undefined. 
When Xnd = 0, we set = 0. 


E. Min-sum SHyGAMP: Inference of Zm 

With diagonal and the implementation of lines fT^ 
IT 3 ] in Algorithm [T] also simplifies. Recall that, when the 
likelihood takes the MLR form in (O, line [12] manifests as 
(l26l l. which can be solved using a component-wise Newton’s 
method as in (llB-lESj) for any and When is 
diagonal, the first and second derivatives (l32] |- (l3Tl l reduce to 

= Py\z{d\zl^'>) - Sy^-d + {^,^1 - Pmd)/(Ld- ( 69 ) 
- Py\zid\z^r^'>) - (70) 


which leads to a reduction in complexity. 

Furthermore, line [T3] simplifies, since with diagonal it 
suffices to compute only the diagonal components of in 
(l30t . In particular, when is diagonal, the result becomes 

^ (71) 


9md 


^hmd + P'i\z{d\Zm) - P^/\z{d\Zmf 


D. Min-suin SHyGAMP: Inference of Xn 

With diagonal and Q^, the implementation of lines |3|4| 
in Algorithm [T] can be significantly simplified. Recall that, 
when the prior px is chosen as i.i.d. Laplace (|5]l, line |3 
manifests as (EB, which is in general a non-trivial optimiza¬ 
tion problem. But with diagonal Ql^, (ItTT i decouples into D 
instances of the scalar optimization 


1 (x 

Xnd = argmax-- — 

X Z 


- Tnd)'^ 
<lnd 


A|a;|, 


( 66 ) 


F. SHyGAMP summary 

In summary, by approximating the covariance matrices as 
diagonal, the SPA-SHyGAMP and MSA-SHyGAMP algo¬ 
rithms improve computationally upon their HyGAMP coun¬ 
terparts. A summary of the D-dimensional inference prob¬ 
lems encountered when running SPA-SHyGAMP or MSA- 
SHyGAMP, as well as their associated computational costs, is 
given in Table |II| A high-level comparison between HyGAMP 
and SHyGAMP is given in Table |ni| 
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Algorithm 

HyGAMP 

SHyGAMP 

Diagonal covariance matrices 


/ 

Simplified D-dimensional inference 


/ 

Scalar-vaiiance approximation 


/ 

Online parameter tuning 


/ 


TABLE III: High-level comparison of SHyGAMP and HyGAMP. 


IV. Online Parameter Tuning 

The weight vector priors in (|5]l and (|39] | depend on modeling 
parameters that, in practice, must be tuned. Although cross- 
validation (CV) is the customary approach to tuning such 
model parameters, it can be very computationally costly, since 
each parameter must be tested over a grid of hypothesized 
values and over multiple data folds. For example, iT-fold 
cross-validation tuning of P parameters using G hypothesized 
values of each parameter requires the training and evaluation 
of KG^ classifiers. 

A. Parameter selection for Sum-product SHyGAMP 

For SPA-SHyGAMP, we propose to use the zero-mean 
Bernoulli-Gaussian prior in (IWt . which has parameters jSd, 
rrid, and Vd- Instead of CV, we use the EM-GM-AMP frame¬ 
work described in ll25l to tune these parameters online. See 
on for details regarding the initialization of Pd, rrid, and Vd- 


B. Parameter selection for Min-sum SHyGAMP 

To use MSA-SHyGAMP with the Laplacian prior in (|5]l, 
we need to specify the scale parameter A. For this, we 
use a modification of the SURE-AMP framework from 12^ . 
which adjusts A to minimize the Stein’s unbiased risk estimate 
(SURE) of the weight-vector MSE. 

We describe our method by first reviewing SURE and 
SURE-AMP. Eirst, suppose that the goal is to estimate the 
value of X, which is a realization of the random variable X, 
from the noisy observation r, which is a realization of 

r = X + y/7w, (72) 

with W ^ AA(0,1) and > 0. Eor this purpose, consider 
an estimate of the form x = f(r,q^\ 6) where 6 contains 
tunable parameters. Eor convenience, define the shifted esti¬ 
mation function g{r, q^]6) = f{r, q^\6) — r and its derivative 
g'{r,q'^;6) = ■^g{r^q^]9). Then Stein BOl established the 
following result on the mean-squared error, or risk, of the 
estimate x\ 

P{[X-X]^) =q^ + P{g^(r,q^-e)+2q^g'{r,q^-e)]. (73) 

The implication of (l73t is that, given only the noisy observa¬ 
tion r and the noise variance one can compute an estimate 

SURE(r, (?■•; 6) ^ q^ + g^ (r, 0) + 2q^g'{r, q^; 6) (74) 

of the MSE(0) = E {[x — x]^} that is unbiased, i.e., 

E{SURE(r,g'';e)} =MSE(6>). (75) 

These unbiased risk estimates can then be used as a surrogate 
for the true MSE when tuning 9. 


In Il26l . it was noticed that the assumption (l72l i is satisfied 
by AMP’s denoiser inputs and thus 12^ proposed 

to tune the soft threshold A to minimize the SURE: 

N 

A = argmin^p^(f„,g'';A) -f 2g''p'(f„, g""; A). (76) 

Recalling the form of the estimator /(•) from (l67l l. we have 


g^(f„,g''; A) 
g'(f„,g''; A) 


A^(g'')2 if|f„|>Ag'' 
otherwise 

-1 if |f„| < Ag'' 

0 otherwise 


(77) 

(78) 


However, solving (l76l l for A is non-trivial because the objective 
is non-smooth and has many local minima. A stochastic gradi¬ 
ent descent approach was proposed in ll2^ . but its convergence 
speed is too slow to be practical. 

Since (ITTT i also matches the scalar-variance SHyGAMP 
model from Section IIII-AI we propose to use SURE to tune A 
for min-sum SHyGAMP. But, instead of the empirical average 
in GUI, we propose to use a statistical average, i.e.. 


A = argminE{g2(r,g''; A) -f 2gV(r, g''; A)}, 

A '-„-' 

= JW 


(79) 


by modeling the random variable r as a Gaussian mixture 
(GM) whose parameters are fitted to {fnd}- As a result, 
the objective in (l79l l is smooth. Moreover, by constraining 
the smallest mixture variance to be at least g'', the objective 
becomes unimodal, in which case A from (|79]l is the unique 
root of ^ J(A). To find this root, we use the bisection method. 
In particular, due to (l77l) - (l78l) . the objective in (f79b becomes 

J(A) = f pr{r)X^{q'^)'^ dr + f p,{r){r'^ - 2q'^) dr 

J —oo J —Ag*” 

nOO 

+ / pr{r)X^{q'^fdr, (80) 

JXq^ 

from which it can be shown that m 

Aj(A) = 2A(g^)2[l - Pr{-Ag^ < r < Ag--}] 

- [pr(Ag'')+pr(-Ag'')]2(g'')^. (81) 

Eor GM fitting, we use the standard EM approach Q and find 
that relatively few (e.g., L = 3) mixture terms suffice. Note 
that we re-tune A using the above technique at each iteration 
of Algorithm [T] immediately before line 0 Experimental 
verification of our method is provided in Section IV-BI 


V. Numerical Results 

In this section we describe the results of several experiments 
used to test SHyGAMP. In these experiments, EM-tuned SPA- 
SHyGAMP and SURE-tuned MSA-SHyGAMP were com¬ 
pared to two state-of-the-art sparse MLR algorithms: SBMLR 
01 and GLMNET Q3. We are particularly interested in 
SBMLR and GLMNET because II13I14II show that they have 
strong advantages over earlier algorithms, e.g., l|io[n[i2|. 
As described in Section II-CI both SBMLR and GLMNET 
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use £i regularization, but SBMLR tunes the regularization 
parameter A using evidence maximization while GLMNET 
tunes it using cross-validation (using the default value of 10 
folds unless otherwise noted). For SBMLR and GLMNET, 
we ran code written by the author^ under default settings 
(unless otherwise noted). For SHyGAMP, we used the damp¬ 
ing modihcation described in ll30l . We note that the runtimes 
reported for all algorithms include the total time spent to tune 
all parameters and train the hnal classiher. 

Due to space limitations, we do not show the performance 
of the more complicated HyGAMP algorithm from Section HU 
However, our ej^erience suggests that HyGAMP generates 
weight matrices X that are very similar to those generated by 
SHyGAMP, but with much longer runtimes, especially as D 
grows. 

A. Synthetic data in the M N regime 

We hrst describe the results of three experiments with 
synthetic data. For these experiments, the training data were 
randomly generated and algorithm performance was averaged 
over several data realizations. In all cases, we started with 
balanced training labels G {1; • ■ ■; D} for m = 1,..., M 
(i.e., M/D examples from each of D classes). Then, for 
each data realization, we generated M i.i.d. training fea¬ 
tures Gm from the class-conditional generative distribution 
ttm It/m In doing so, we chose the intra¬ 

class variance, v, to attain a desired Bayes error rate (BER) of 
10% (see on for details), and we used randomly generated K- 
sparse orthonormal class means, G R.^. In particular, we 
generated ..., fijy] by drawing a K x K matrix with i.i.d. 
Af{0, 1) entries, performing a singular value decomposition, 
and zero-padding the hrst D left singular vectors to length N. 
We note that our generation of y, A, X is matched lim to the 
multinomial logistic model (|2]i-(|2l. 

Given a training data realization, each algorithm was in¬ 
voked to yield a weight matrix X = \xi,...^xo\. The 
corresponding expected test-error rate was then analytically 
computed as 

1 ^ 

Pr{err} = ^ - jjYl Pricorl//} (82) 

y^l 

Pr{coii?/} = Pr Pi |(®j^ - Xd^a < (xy - , (83) 

d=^V 

where a ^ A/^(0, vl m) and the multivariate normal cdf in (l8^ 
was computed using Matlab’s mvncdf. 

For all three synthetic-data experiments, we used D = A 
classes and K M N. In the hrst experiment, we hxed 
K and N and we varied M ; in the second experiment, we hxed 
K and M and we varied AT; and in the third experiment, we 
hxed K and M and we varied N. The specihc values/ranges 
of K, M, N used for each experiment are given in Table |IV] 
Figures |4^-b show the expected test-error rate and runtime, 
respectively, versus the number of training examples, M, 
averaged over 12 independent trials. Figure |4^ shows that, 

^SBMLR obtained from http://theoval.cmp.uea.ac.uk/matlab/ 

^GLMNET obtained from http://www.stanford.edu/'^hastie/glmnet_matlab/ 


Experiment 

M 

N 

K 

D 

1 

{too, . . . ,5000} 

10000 

10 

4 

2 

300 

30000 

{5, ... ,30} 

4 

3 

200 

{10^, .... 

10 

4 

4 

300 

30000 

25 

4 


TABLE IV: Conhgurations of the synthetic-data experiments. 


at all tested values of M, SPA-SHyGAMP gave the best 
error-rates and MSA-SHyGAMP gave the second best error- 
rates, although those reached by GLMNET were similar at 
large M. Moreover, the error-rates of SPA-SHyGAMP, MSA- 
SHyGAMP, and GLMNET all converged towards the BER 
as M increased, whereas that of SBMLR did not. Since 
MSA-SHyGAMP, GLMNET, and SBMLR all solve the same 

-regularized MLR problem, the difference in their error- 
rates can be attributed to the difference in their tuning of 
the regularization parameter A. Figure |4]r shows that, for 
M > 500, SPA-SHyGAMP was the fastest, followed by MSA- 
SHyGAMP, SBMLR, and GLMNET. Note that the runtimes of 
SPA-SHyGAMP, MSA-SHyGAMP, and GLMNET increased 
linearly with M, whereas the runtime of SBMLR increased 
quadratically with M. 

Figures |5^-b show the expected test-error rate and runtime, 
respectively, versus feature-vector sparsity, K, averaged over 
12 independent trials. Figure|5t shows that, at all tested values 
of K, SPA-SHyGAMP gave the best error-rates and MSA- 
SHyGAMP gave the second best error-rates. Figure [Sj? shows 
that SPA-SHyGAMP and MSA-SHyGAMP gave the fastest 
runtimes. All runtimes were approximately invariant to K. 

Figures |6^-b show the expected test-error rate and runtime, 
respectively, versus the number of features, N, averaged over 
12 independent trials. Figure |6^ shows that, at all tested values 
of N, MSA-SHyGAMP gave lower eiTor-rates than SBMLR 
and GLMNET. Meanwhile, SPA-SHyGAMP gave the lowest 
error-rates for certain values of N. Figure [bj) shows that SPA- 
SHyGAMP and MSA-SHyGAMP gave the fastest runtimes 
for N > 10000, while SBMLR gave the fastest runtimes for 
N < 3000. All runtimes increased linearly with N. 


B. Example of SURE tuning 

Although the good error-rate performance of MSA- 
SHyGAMP in Section FV-AI suggests that the SURE A-tuning 
method from Section IIV-BI is working reliably, we now de¬ 
scribe a more direct test of its behavior. Using synthetic data 
generated as described in Section IV-AI with D = A classes, 
N = 30000 features, M = 300 examples, and sparsity 
K = 25, we ran MSA-SHyGAMP using various hxed values 
of A. In the sequel, we refer to this experiment as “Synthetic 
Experiment 4.” The resulting expected test-error rate versus 
A (averaged over 10 independent realizations) is shown in 
Fig .Q For the same realizations, we ran MSE-SHyGAMP with 
SURE-tuning and plot the resulting error-rate and average A in 
Fig. Ill From Fig. |7] we see that the SURE A-tuning method 
matched both the minimizer and the minimum of the error- 
versus-A trace of hxed-A MSA-SHyGAMP. 
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(a) Error 



(b) Runtime 

Fig. 4: Synthetic Experiment 1: expected test-error rate and runtime 
versus M. Here, D = A, N = 10000, and K = 10. 


C. Micro-array gene expression 

Next we consider classification and feature-selection using 
micro-array gene expression data. Here, the labels indicate 
which type of disease is present (or no disease) and the 
features represent gene expression levels. The objective is i) 
to determine which subset of genes best predicts the various 
diseases and ii) to classify whether an (undiagnosed) patient 
is at risk for any of these diseases based on their gene profile. 

We tried two datasets: one from Sun et al. jTl and one 
from Bhattacharjee et al. ||2|- The Sun dataset includes M = 
179 examples, N = 54613 features, and D = A classes; 
and the Bhattacharjee dataset includes M = 203 examples, 
N = 12600 features, and D = 5 classes. With the Sun 
dataset, we applied a log 2 (-) transformation and z-scored prior 
to processing, while with Bhattacharjee we simply z-scored 
(since the dataset included negative values). 

The test-error rate was estimated as follows for each dataset. 
We consider a total of T “trials.” For the fth trial, we i) 
partition the dataset into a training subset of size Mtrain.t and 



(a) Error 



(b) Runtime 

Fig. 5: Synthetic Experiment 2: expected test-error rate and runtime 
versus K. Here, D = 4, M = 300, and N = 30000. 


a test subset of size Mtest.t, ii) design the classifier using the 
training subset, and iii) apply the classifier to the test subset, 
recording the test errors {etm}m=i*’ where etm G {0,1} 
indicates whether the mth example was in error. We then 
estimate the average test-error rate using the empirical average 

M = -^testSt=i Sm=r where iWtest = -^test.t- If 

the test sets are constructed without overlap, we can model 
{etm} as i.i.d. Bernoulli(/j,), where p denotes the true test- 
error rate. Then, since p is Binomial(/i, Mtest), the standard 
deviation (SD) of our etTor-rate estimate p is i/var}/!} = 
•\//r(l — /r)/Mtest- Since /i is unknown, we approximate the 
SD by A/}t(l - ^)/Mtest- 

Tables |V] and |Vl] show, for each algorithm, the test-error 
rate estimate }t, the approximate SD of the 

estimate, the average runtime, and two metrics for the sparsity 
of X. The ||-Xi||o metric quantifies the number of non-zero 
entries in X (i.e., absolute sparsi^, while the iTgg metric 
quantifies the number of entries of X needed to reach 99% of 
the Frobenius norm of X (i.e., effective sparsity). We note that 
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(a) Error 



(b) Runtime 

Fig. 6: Synthetic Experiment 3: expected test-error rate and runtime 
versus N. Here, D = A, M = 200, and K = 10. 



Fig. 7: Synthetic experiment 4: expected test-error rate versus regu¬ 
larization parameter A for fixed-A MSA-SHyGAMR Here, D = 4, 
M = 300, N = 30000, and K = 25. Also shown is the average test- 
error rate for SURE-tuned MSA-SHyGAMP plotted at the average 
value of A. 


Algorithm 

% Error (SD) 

Runtime (s) 

K 99 

ir^'iio 

SPA-SHyGAMP 

33.3 (3.8) 

6.86 

20.05 

218452 

MSA-SHyGAMP 

31.0 (3.7) 

13.59 

93.00 

145.32 

SBMLR 

31.6 (3.7) 

22.48 

49.89 

72.89 

GLMNET 

33.9 (3.8) 

31.93 

10.89 

16.84 


TABLE V: Estimated test-error rate, standard deviation of estimate, 
runtime, and sparsities for the Sun dataset. 


However, SPA-SHyGAMP ran about twice as fast as SBMLR, 
and 4x as fast as GLMNET. As in the Sun dataset, SPA- 
SHyGAMP returned the sparsest weight matrix according to 
the ATgg metric. The sparsities of the weight matrices returned 
by the other three algorithms were similar to one another in 
both metrics. Unlike in the Sun dataset, MSA-SHyGAMP and 
SBMLR had similar runtimes (which is consistent with Fig. |6}2 
since N is lower here than in the Sun dataset). 


the reported values of ATgg and ||X||o represent the average 
over the T folds. For both the Sun and Bhattacharjee datasets, 
we used T = 19 trials and = [M/20J Vf. 

Table |V] shows results for the Sun dataset. There we see 
that MSA-SHyGAMP gave the best test-error rate, although 
the other algorithms were not far behind and all error-rate 
estimates were within the estimator standard deviation. SPA- 
SHyGAMP was the fastest algorithm and MSA-SHyGAMP 
was the second fastest, with the remaining algorithms running 
3x to 5x slower than SPA-SHyGAMP. GLMNET’s weights 
were the sparsest according to both sparsity metrics. SPA- 
SHyGAMP’s weights had the second lowest value of Kgg, 
even though they were technically non-sparse (i.e., ||X||o = 
218 452 = ND) as expected. Meanwhile, MSA-SHyGAMP’s 
weights were the least sparse according to the Kgg metric. 

Table rvTl shows results for the Bhattacharjee dataset. In this 
experiment, SPA-SHyGAMP and SBMLR were tied for the 
best error rate, MSA-SHyGAMP was 0.5 standard-deviations 
worse, and GLMNET was 1.2 standard-deviations worse. 


D. Text classification with the RCVl dataset 

Next we consider text classification using the Reuter’s 
Corpus Volume 1 (RCVl) dataset im. Here, each sample 
iUm, o-m) represents a news article, where ym indicates the 
article’s topic and a^n indicates the frequencies of common 
words in the article. The version of the dataset that we usec0 
contained = 47 236 features and 53 topics. However, we 
used only the first Z? = 25 of these topics (to reduce the 
computational demand). Also, we retained the default training 
and test partitions, which resulted in the use of M = 14 147 
samples for training and 469 571 samples for testing. 

The RCVl features are very sparse (only 0.326% of the 
features are non-zero) and non-negative, which conflicts with 
the standard assumptions used for the derivation of AMP 
algorithms; that A is i.i.d. zero-mean and sub-Gaussian. Inter¬ 
estingly, the RCV1 dataset also caused difficulties for SBMLR, 
which diverged under default settings. This divergence was 

' http://www.csie.ntu.edu.t\v/~cjlin/libsvmtools/datasets/multiclass.html 
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Algorithm 

% Error (SD) 

Runtime (s) 

iTgg 

ll^'llo 

SPA-SHyGAMP 

9.5 (2.1) 

3.26 

16.15 

63 000 

MSA-SHyGAMP 

10.5 (2.2) 

6.11 

55.20 

84.65 

SBMLR 

9.5 (2.1) 

6.65 

44.25 

79.10 

GLMNET 

12.0 (2.4) 

13.67 

49.65 

89.40 


TABLE VI: Estimated test-error rate, standard deviation of estimate, 
runtime, and sparsities for the Bhattacharjee dataset. 



Cumulative Runtime [sec] 


Eig. 8: Test-error rate versus runtime for the RCVl dataset. 

remedied by decreasing the value of a step-size paramete]0 to 
0.1 from the default value of 1. 

Figure [8] shows test-error rate versus runtime for SPA- 
SHyGAMP, MSA-SHyGAMP, SBMLR, and GLMNET on 
the RCVl dataset. In the case of SPA-SHyGAMP, MSA- 
SHyGAMP and SBMLR, each point in the figure represents 
one iteration of the corresponding algorithm. For GLMNET, 
each data-point represents one iteration of the algorithm 
after its cross-validation stage has completed]^ We used 2 
CV folds (rather than the default 10) in this experiment 
to avoid excessively long runtimes. The figure shows that 
the SHyGAMP algorithms converged more than an order-of- 
magnitude faster than SBMLR and GLMNET, although the 
final error rates were similar. SPA-SHyGAMP displayed faster 
initial convergence, but MSA-SHyGAMP eventually caught 
up. 

E. MNIST handwritten digit recognition 

Finally, we consider handwritten digit recognition using 
the Mixed National Institute of Standards and Technology 
(MNIST) dataset ll42l . This dataset consists of 70 000 exam¬ 
ples, where each example is an = 784 pixel image of 
one of I? = 10 digits between 0 and 9. These features were 
again non-negative, which conflicts with the standard AMP 
assumption of i.i.d. zero-mean A. 

Our experiment characterized test-error rate versus the num¬ 
ber of training examples, M, for the SPA-SHyGAMP, MSA- 
SHyGAMP, SBMLR, and GLMNET algorithms. For each 

*See the variable scale on lines 129 and 143 of sbmlr.m. 

^GLMNET spent most of its time on cross-validation. After cross- 
validation, GLMNET took 25.26 seconds to run, which is similar to the total 
runtimes of SPA-SHyGAMP and MSE-SHyGAMP. 



Number of Training Samples M 


Fig. 9: Estimated test-error rate versus M for the MNIST dataset, 
with error-bars indicating the standard deviation of the estimate. 


value of M, we performed 50 Monte-Carlo trials. In each trial, 
M training samples were selected uniformly at random and the 
remainder of the data were used for testing. Figure |9] shows 
the average estimated test-error rate fi versus the number of 
training samples, M, for the algorithms under test. The error- 
bars in the figure correspond to the average of the per-trial 
estimated SD over the 50 trials. For SBMLR, we reduced 
the stepsize to 0.5 from the default value of 1 to prevent 
a significant degradation of test-error rate. The figure shows 
SPA-SHyGAMP attaining significantly better error-rates than 
the other algorithms at small values of M (and again at the 
largest value of M considered for the plot). For this plot, M 
was chosen to focus on the M < N regime. 


VI. Conclusion 

For the problem of multi-class linear classification and fea¬ 
ture selection, we proposed several AMP-based approaches to 
sparse multinomial logistic regression. We started by propos¬ 
ing two algorithms based on HyGAMP im, one of which 
finds the maximum a posteriori (MAP) linear classifier based 
on the multinomial logistic likelihood and a Laplacian prior, 
and the other of which finds an approximation of the test- 
error-rate minimizing linear classifier based on the multino¬ 
mial logistic likelihood and a Bernoulli-Gaussian prior. The 
numerical implementation of these algorithms is challenged, 
however, by the need to solve iT-dimensional inference prob¬ 
lems of multiplicity M at each HyGAMP iteration. Thus, we 
proposed simplified HyGAMP (SHyGAMP) approximations 
based on a diagonalization of the message covariances and a 
careful treatment of the H-dimensional inference problems. In 
addition, we described EM- and SURE-based methods to tune 
the hyperparameters of the assumed statistical model. Finally, 
using both synthetic and real-world datasets, we demonstrated 
improved error-rate and runtime performance relative to the 
state-of-the-art SBMLR lIT^ and GLMNET llT4l approaches 
to sparse MLR. 
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